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CHINESE WORD SEGMENTATION 

BACKGROUND OF THE INVENTION 
The present invention relates generally to 
the field of natural language processing. More 
5 specifically, the present invention relates to word 
segmentation. 

Word segmentation refers to the process of 
identifying the individual words that make up an 
expression of language, such as text. Word 
10 segmentation is useful for checking spelling and 
grammar, synthesizing speech from text, and 
performing natural language parsing and 

understanding, all of which benefit from an 
identification of individual words. 
15 . Performing word segmentation of English 

text is rather straightforward, since spaces and 
punctuation marks generally delimit the individual 
words in the text. Consider the English sentence in 
Table 1 below. 

20 

The motion was then tabled- -that is, removed 
indefinitely from consideration. 

Table 1 

25 By identifying each contiguous sequence of 

spaces and/or punctuation marks as the end of the 
word preceding the sequence, the English sentence in 
Table 1 may be straightforwardly segmented as shown 
in Table 2 below. 

30 

The motion was then tabled -- that is , removed 
indefinitely from consideration . 
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Table 2 

In Chinese text, word boundaries are 
implicit rather than explicit. Consider the sentence 
5 in Table 3 below, meaning "The committee discussed 
this problem yesterday afternoon in Buenos Aires." 

Table 3 

10 

Despite the absence of punctuation and 
spaces from the sentence, a reader of Chinese would 
recognize the sentence in Table 3 as being comprised 
of the words separately underlined in Table 4 below. 

15 

tfr s t * m a t & iiiiuij n it z&± 

Table 4 

2 0 Many methods and systems have been devised 

to provide word segmentation for languages such as 
Chinese and Japanese. In some systems, models are 
trained based on a corpus of segmented text. The 
models describe the likelihood of various segments 

2 5 appearing in a text string and provide an output 

indicative thereof. Developing a corpus to train the 
models takes time and expense. In many instances, the 
quality of the output of an associated word 
segmentation system depends largely upon the quality 

3 0 of the corpus used to train the model. As a result, a 
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method for evaluating corpora and developing corpora 
will aide in providing quality word segmentation. 
SUMMARY OF THE INVENTION 
The present invention relates to a corpus 
5 for use in training a language model. The corpus 
includes a plurality of characters and a plurality of 
morphological tags associated with a plurality of 
sequences of characters. The plurality of 
morphological tags indicate a morphological type of 
10 an associated sequence of characters and a 
combination of parts forming a morphological subtype. 

In another aspect, a computer readable 
medium having instructions for performing word 
segmentation is provided. The instructions include 
15 receiving an input of unsegmented text and accessing 
a language model to determine a segmentation of the 
text. A morphologically derived word is detected in 
the text and an output indicative of segmented text 
and an indication of a combination of parts that form 
2 0 the morphologically derived word is provided. 

BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 is a block diagram of a general 
computing environment in which the present invention 
can be useful . 

25 FIG. 2 is a block diagram of a language 

processing system. 

FIG. 3 is a flow diagram of a method for 
developing an annotated corpus . 
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FIG. 4 is a flow diagram for creating a 
language model and evaluating the performance of the 
language model . 

FIG. 5 is a block diagram of types and 
5 subtypes of morphologically derived words. 

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS 

Prior to discussing the present invention 
in greater detail, an embodiment of an illustrative 
environment in which the present invention can be 

10 used will be discussed. FIG. 1 illustrates an example 
of a suitable computing system environment 100 on 
which the invention may be implemented. The 
computing system environment 100 is only one example 
of a suitable computing environment and is not 

15 intended to suggest any limitation as to the scope of 
use or functionality of the invention. Neither should 
the computing environment 100 be interpreted as 
having any dependency or requirement relating to any 
one or combination of components illustrated in the 

20 exemplary operating environment 100. 

The invention is operational with numerous 
other general purpose or special purpose computing 
system environments or configurations. Examples of 
well known computing systems, environments, and/or 

25 configurations that may be suitable for use with the 
invention include, but are not limited to, personal 
computers, server computers, hand-held or laptop 
devices , mul t iprocessor systems , microprocessor-based 
systems, set top boxes, programmable consumer 

30 electronics, network PCs, minicomputers, mainframe 
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computers, distributed computing environments that 
include any of the above systems or devices, and the 
like . 

The invention may be described in the 
5 general context of computer-executable instructions, 
such as program modules, being executed by a 
computer. Generally, program modules include 
routines, programs, objects, components, data 
structures, etc. that perform particular tasks or 

10 implement particular abstract data types. Those 
skilled in the art can implement the description 
and/or figures herein as computer-executable 
instructions, which can be embodied on any form of 
computer readable media discussed below. 

15 The invention may also be practiced in 

distributed computing environments where tasks are 
performed by remote processing devices that are 
linked through a communications network. In a 
distributed computing environment, program modules 

2 0 may be located in both local and remote computer 
storage media including memory storage devices. 

With reference to FIG. 1, an exemplary 
system for implementing the invention includes a 
general purpose computing device in the form of a 

25 computer 110. Components of computer 110 may 

include, but are not limited to, a processing unit 
12 0, a system memory 13 0, and a system bus 121 that 
couples various system components including the 
system memory to the processing unit 120. The system 

30 bus 121 may be any of several types of bus structures 
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including a memory bus or memory controller, a 
peripheral bus, and a local bus using any of a 
variety of bus architectures. By way of example, and 
not limitation, such architectures include Industry 
5 Standard Architecture (ISA) bus, Micro Channel 
Architecture (MCA) bus, Enhanced ISA (EISA) bus, 
Video Electronics Standards Association (VESA) local 
bus, and Peripheral Component Interconnect (PCI) bus 
also known as Mezzanine bus. 

10 Computer 110 typically includes a variety 

of computer readable media. Computer readable media 
can be any available media that can be accessed by 
computer 110 and includes both volatile and 
nonvolatile media, removable and non-removable media. 

15 By way of example, and not limitation, computer 
readable media may comprise computer storage media 
and communication media. Computer storage media 
includes both volatile and nonvolatile, removable and 
non-removable media implemented in any method or 

2 0 technology for storage of information such as 
computer readable instructions, data structures, 
program modules or other data. Computer storage 
media includes, but is not limited to, RAM, ROM, 
EEPROM, flash memory or other memory technology, CD- 

25 ROM, digital versatile disks (DVD) or other optical 
disk storage, magnetic cassettes, magnetic tape, 
magnetic disk storage or other magnetic storage 
devices, or any other medium which can be used to 
store the desired information and which can be 

30 accessed by computer 110 . Communication media 
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typically embodies computer readable instructions, 
data structures, program modules or other data in a 
modulated data signal such as a carrier wave or other 
transport mechanism and includes any information 
5 delivery media. The term "modulated data signal" 
means a signal that has one or more of its 
characteristics set or changed in such a manner as to 
encode information in the signal. By way of example, 
and not limitation, communication media includes 

10 wired media such as a wired network or direct -wired 
connection, and wireless media such as acoustic, RF, 
infrared and other wireless media. Combinations of 
any of the above should also be included within the 
scope of computer readable media. 

15 The system memory 13 0 includes computer 

storage media in the form of volatile and/or 
nonvolatile memory such as read only memory (ROM) 131 
and random access memory (RAM) 132. A basic 
input/output system 133 (BIOS) , containing the basic 

20 routines that help to transfer information between 
elements within computer 110, such as during start- 
up, is typically stored in ROM 131. RAM 132 typically 
contains data and/or program modules that are 
immediately accessible to and/or presently being 

25 operated on by processing unit 120. By way of 
example, and not limitation, FIG. 1 illustrates 
operating system 134, application programs 135, other 
program modules 136, and program data 13 7. 

The computer 110 may also include other 

3 0 removable/non-removable volatile/nonvolatile computer 



storage media. By way of example only, FIG. 1 
illustrates a hard disk drive 141 that reads from or 
writes to non- removable, nonvolatile magnetic media, 
a magnetic disk drive 151 that reads from or writes 
to a removable, nonvolatile magnetic disk 152, and an 
optical disk drive 155 that reads from or writes to a 
removable, nonvolatile optical disk 156 such as a CD 
ROM or other optical media. Other removable/non- 
removable, volatile/nonvolatile computer storage 
media that can be used in the exemplary operating 
environment include, but are not limited to, magnetic 
tape cassettes, flash memory cards, digital versatile 
disks, digital video tape, solid state RAM, solid 
state ROM, and the like. The hard disk drive 141 is 
typically connected to the system bus 121 through a 
non- removable memory interface such as interface 14 0, 
and magnetic disk drive 151 and optical disk drive 
155 are typically connected to the system bus 121 by 
a removable memory interface, such as interface 150. 

The drives and their associated computer 
storage media discussed above and illustrated in FIG. 
1, provide storage of computer readable instructions, 
data structures, program modules and other data for 
the computer 110. In FIG. 1, for example, hard disk 
drive 141 is illustrated as storing operating system 
144, application programs 145, other program modules 
14 6, and program data 14 7. Note that these components 
can either be the same as or different from operating 
system 134, application programs 13 5, other program 
modules 136, and program data 137. Operating system 
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144, application programs 145, other program modules 
146, and program data 147 are given different numbers 
here to illustrate that, at a minimum, they are 
different copies. 
5 A user may enter commands and information 

into the computer 110 through input devices such as a 
keyboard 162, a microphone 163, and a pointing device 
161, such as a mouse, trackball or touch pad. Other 
input devices (not shown) may include a joystick, 

10 game pad, satellite dish, scanner, or the like. 
These and other input devices are often connected to 
the processing unit 120 through a user input 
interface 160 that is coupled to the system bus, but 
may be connected by other interface and bus 

15 structures, such as a parallel port, game port or a 
universal serial bus (USB) . A monitor 191 or other 
type of display device is also connected to the 
system bus 121 via an interface, such as a video 
interface 190. In addition to the monitor, computers 

2 0 may also include other peripheral output devices such 
as speakers 197 and printer 196, which may be 
connected through an output peripheral interface 195. 

The computer 110 may operate in a networked 
environment using logical connections to one or more 

25 remote computers, such as a remote computer 180. The 
remote computer 180 may be a personal computer, a 
hand-held device, a server, a router, a network PC, a 
peer device or other common network node, and 
typically includes many or all of the elements 

30 described above relative to the computer 110. The 
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logical connections depicted in FIG. 1 include a 
local area network (LAN) 171 and a wide area network 
(WAN) 173, but may also include other networks. Such 
networking environments are commonplace in offices, 
5 enterprise-wide computer networks, intranets and the 
Internet . 

When used in a LAN networking environment, 
the computer 110 is connected to the LAN 171 through 
a network interface or adapter 170. When used in a 

10 WAN networking environment, the computer 110 
typically includes a modem 172 or other means for 
establishing communications over the WAN 173, such as 
the Internet. The modem 172, which may be internal 
or external, may be connected to the system bus 121 

15 via the user- input interface 160, or other 
appropriate mechanism. In a networked environment, 
program modules depicted relative to the computer 
110, or portions thereof, may be stored in the remote 
memory storage device. By way of example, and not 

20 limitation, FIG. 1 illustrates remote application 
programs 185 as residing on remote computer 180. It 
will be appreciated that the network connections 
shown are exemplary and other means of establishing a 
communications link between the computers may be 

25 used. 

FIG. 2 generally illustrates a language 
processing system 2 00 that receives a language input 
202 to provide a language output 204. For example, 
the language processing system 200 can be embodied as 
3 0 a word segmentation system or module that receives as 
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language input 202 unsegmented text. The language 
processing system 2 00 processes the unsegmented text 
and provides an output 2 04 indicative of segmented 
text and accompanying information related to the 
5 segmented text . 

During processing, the language processing 
system 200 can access a language model 206 in order 
to determine a segmentation for the input text 202. 
Language model 2 06 can be constructed from an 

10 annotated corpus that defines various types of words 
as well as an indication of the specific type. As 
appreciated by those skilled in the art, language 
processing system 200 can be useful in various 
situations such as spell checking, grammar checking, 

15 synthesizing speech from text, speech recognition, 
information retrieval and performing natural language 
parsing and understanding to name ' a few. 
Additionally, language model 206 may be developed 
based on the particular application for which 

2 0 language processing system 200 is used. 

In addition to providing segmentation, 
system 2 00 also provides an indication of word type 
for each of the segmented words. In one embodiment, 
Chinese words are defined as one of the following 
25 four types: (1) entries in a given lexicon (lexicon 
words or LWs hereafter), (2) morphologically derived 
words (MDWs), (3) factoids such as Date, Time, 
Percentage, Money, etc., and (4) named entities (NEs) 
such as person names (PNs) , location names (LNs) , and 

3 0 organization names (ONs) . Various subtypes can also 
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be defined. Given the definitions of these types of 
words, system 2 00 can provide an output indicative of 
segmentation and word type. For example, consider the 
unsegmented sentence in Table 5 below, meaning 
5 "Friends happily go to Professor Li Junsheng's home 
for lunch at twelve thirty." 

Table 5 

10 

An exemplary output of system 200 is shown 
in Table 6 below. Square brackets indicate word 
boundaries and a w +" indicates a morpheme boundary. 
Tags are provided within the brackets to indicate the 
15 various types and subtypes of words within the 
sentence . 

[J$£+ff] MA__S] [+— + ^ 12:30 TIME] MR_AABB] 

mi pn] mm] im i^m 

2 0 Table 6 

In order to provide segmentation, language 
model 206 detects word types in the input text 202. 
For lexicon words, word boundaries are detected if 
2 5 the word is contained in the lexicon. For 
morphologically derived words, morphological patterns 
are detected, e.g. Kj^+jf] (which means friend+s) is 
derived by affixation of the plural affix i\l to the 
noun MM (MA_S is a tag that indicates a suffixation 
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pattern) , and (which means happily) is a 

reduplication of (happy) (MR_AABB is a tag that 

indicates an AABB reduplication pattern) . 

. In the case of factoids, their types and 
5 normalized forms are detected, e.g. 12:30 is the 
normalized form of the time expression +— j£zi~h# 
(TIME is a tag that indicates a time expression) . For 
named entities, subtypes are detected, e.g. $^^tl (Li 
Junsheng) is a person name (PN is a . tag that 

10 indicates a person name) . 

Language model 2 06 can be created from an 
annotated corpus. FIG. 3 illustrates a method 250 for 
developing an annotated corpus that is to be used for 
creating language models for word segmentation 

15 systems, such as language model 206 of system 200. At 
step 252, words and rules pertaining to word 
segmentation are defined. For example, a lexicon for 
Chinese word segmentation, a rule set for Chinese 
morphologically derived words, a guideline of Chinese 

20 factoids and named entities and/or combinations 
thereof may be defined for developing the annotated 
corpus. At step 254, an extensive corpus is provided 
that includes a large amount of text as well as a 
large variety of text. The extensive corpus may be 

25 chosen from various text sources such as newspapers 
and magazines. Next, at step 256, a list that matches 
the words and rules defined in step 252 is extracted 
from the extensive corpus to create a list of 
potential words. 
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At step 258, the extracted list can be 
manually checked if desired to filter out any noise 
or errors within the list. It is then determined 
whether the list has sufficient coverage of the 
5 defined words and rules at step 260. In one 
embodiment, the list may be compared to a balanced, 
independent test corpus having a wide variety of 
domains and styles. For example, the domains and 
styles may include text related to culture, economy, 

10 literature, military, politics, science and 
technology, society, sports, computers and law to 
name a few. Alternatively an application specific 
corpus may be used having broad coverage of a 
particular application. If it is determined that the 

15 list has sufficient coverage, the corpus is then 
tagged at step 2 62. The tagging of the corpus can be 
performed as discussed below. At step 2 64, the tagged 
corpus can be checked and any errors may be 
corrected. At step 266, the resulting corpus is used 

2 0 as a seed corpus to tag a larger amount of text as a 

training or testing corpus. As a result, an annotated 
corpus is developed that can be evaluated using 
method 280 in FIG. 4. 

FIG. 4 illustrates a method 280 for 
25 creating and evaluating a language model 206 in order 
to provide improved word segmentation. At step 282, 
an annotated corpus is developed, the process of 
which is described above with respect to FIG. 3. 
Given the annotated corpus, a training or testing 

3 0 model is created based on the annotated corpus at 
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step 284. At step 286, the model created is evaluated 
by comparing the model to a predefined test corpus or 
other models. Given the evaluation performed in step 
286, the effectiveness of language model 206 can be 
5 determined. 

In order to evaluate a language model, the 
output of a word segmentation system using the model 
can be compared to a standard annotated testing 
corpus that serves as a standard output of a 

10 segmentation system. To achieve a reliable 
evaluation, a raw (unannotated) test corpus may be 
chosen that is independent, balanced and of 
appropriate size. An independent test corpus will 
have a relatively small overlap with the annotated 

15 corpus used to train the language model. A balanced 
corpus contains documents having wide variety of 
domain, style and time. In order to be large enough, 
one embodiment of a test corpus includes 
approximately one million Chinese characters. After 

20 developing the test corpus, the corpus is manually 
annotated to be used as a standard output of a 
Chinese word segmentation system given the test 
corpus. The test corpus can be annotated using the 
tagging specification described below or another 

25 tagging specification. 

Given the annotated test corpus, a 
quantitative evaluation can be used to evaluate the 
performance of a language model. If the total number 
of word tokens in the standard test set is U S" , the 

3 0 total number of word tokens of the output of a word 



-16- 

segmentation system to be evaluated applied to the 
test set is "E" and a number of word tokens in the 
output which exactly matched the word tokens in the 
standard test set is "M" , quantitative values can be 
5 calculated to evaluate performance of the language 
model. Equations 1-3 below show values for precision, 
recall and an F- score. 

Precision '= M/E (1) 
Recall = M/S (2) 
10 F = 2 x Precision x Recall/ (Precision + Recall) (3) 

Furthermore, the evaluation may be 
performed on various subtypes according to equations 
1-3 above. For example, a person name performance 
evaluation may be conducted where S PN is the total 
15 number of person name tokens in the standard test 
corpus. E PN is the total number of person name tokens 
in the output of a word segmentation system to be 
evaluated and M PN . is a the number of person name 
tokens in the output which exactly matched the person 
2 0 names in the standard test set. As a result, the 
performance equations are: 

Precision PN = M PN /E PN (4) 
Recallp N = M PN /S PN (5) 
Fpn = 2 x . Precision^ x Recall PN / (Preci s i on PN + 
25 Recall PN ) (6) 

It is further useful to compare other 
system results in evaluating performance of language 
models. For example, it may be useful to only compare 
various portions of outputs of different word 
30 segmentation systems such as (1) person names, (2) 



location names, (3) organization names, (4) 
overlapping ambiguous strings and (5) covering 
ambiguous strings. By only evaluating a subset of the 
output of the segmentation systems, a better idea of 
where errors are occurring in segmentation can 
result . 

In order to develop annotated corpora, a 
tagging specification is used to consistently tag the 
corpora given the definitions of Chinese word types 
described above. Lexicon words with the lexicon are 
delimited by brackets without additional tagging. 
Other types are tagged as provided below. 

FIG- 5 illustrates a diagram of 
morphological categories for tagging corpora. The 
morphological categories include affixation, 
reduplication, split, merge and head particle. Each 
morphological category or type includes various 
subtypes that can be tagged during the tagging 
process. The format in FIG. 5 shows the category, the 
parts that make the word and the resultant part of 
speech of the word. In the diagram of FIG. 5, "MP" 
stands for morphological prefix and "MS 7 ' stands for 
morphological suffix. "MR" is a reduplication, "ML" a 
split, "MM" denotes a merge and "MHP" is a 
morphological head particle. The part between the 
underscore (_) and the (-) is the combination of 
parts that form the morphologically derived word. For 
reduplication and merge, the characters A, B and C 
represent Chinese characters. 
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The format in FIG. 5 represents 
morphological variations and it will be appreciated 
that other formats of tagging may be used to 
represent the variations. Affixation includes 
5 subcategories prefix and suffix where a character is 
added to a string of other characters to 
morphologically change the word represented by the 
original character. Prefixes includes seven subtypes 
and suffixes include thirteen subtypes. Reduplication 

10 occurs where the original word that consists of a 
pattern of characters is converted into another word 
consisting of a combination of characters and 
includes thirty different subtypes . Reduplication 
also includes a "V" , which represents a verb, "O" is 

15 an object and w l", "le" and "liaozhi" are particles. 

Split includes a set of expressions that 
are separate words at the syntactic level but single 
words at the semantic level. For example, a character 
string ABC may represent the phrase "already ate" , 

2 0 where the bi -character word AC represents the word 

"ate" and is split by the particle character B 
representing the word "already" . Split includes two 
subtypes. One subtype involves inserting a character 
or characters between a verb and an object and the 
25 other inserts an object between the phrase "qilai" . 
Merging occurs where one word consisting of two 
characters and another word consisting of two 
characters are combined to form a single word and 
includes three subtypes. A head particle occurs when 

3 0 combining a verb character with other characters to 



form a word and includes two subtypes that combine an 
adjective and a direction and a verb and a direction. 

The tagging format for named entities and 
factoids is presented in Table 7 below. Format -1 
includes simple tags for various types and subtypes 
to help facilitate quick and easy tagging by a human. 
For example, the name entities for person, location 
and organization are simply tagged as P, L and 0, 
respectively. Format-2 represents tagging using the 
Standardized General Mark-up Language (SGML) 
according to the Second Multilingual Entity Task 
Evaluation (MET-2) . If desired, a transformation 
between format -1 and format-2 can be realized through 
a suitable transformation program. 



Main 


Subcategory 


Format -1 


Format-2 


Category 




tagging set 


tagging set 


PERSON 


PERSON 


P 


PERSON 


LOCATION 


LOCATION 


L 


LOCATION 


ORGANIZATION 


ORGAN I Z AR I ON 


0 


ORGANIZATION 


TIMEX 


Date 


dat 


DATE 




Duration 


dur 


DURATION 




Time 


tim 


TIME 


NUMEX 


Percent 


per 


PERCENT 




Money 


mon 


MONEY 




Frequency 


f re 


FREQUENCY 




Integer 


int 


INTEGER 




Fraction 


fra 


FRACTION 




Decimal 


dec 


DECIMAL 




Ordinal 


ord 


ORDINAL 




Rate 


rat 


RATE 


MEASUREX 


Age 


age 


AGE 




Weight 


wei 


WEIGHT 




Length 


len 


LENGTH 




Temperature 


tern 


TEMPERATURE 




Angle 


ang 


ANGLE 
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Main 


Subcategory 


Format -1 


'Format -2 


Category 




-tagging set 


tagging set §f§ 




Area 


are 


AREA 




Capacity 


cap 


CAPACITY 




Speed 


spe 


SPEED 




Other 


me a 


MEASURE 




measures 






ADDRESSX 


Email 


ema 


EMAIL 




Phone 


pho 


PHONE 




Fax 


fax 


FAX 




Telex 


tel 


TELEX 




WWW 


www 


WWW 



Table 7 



Given the tagging format in Table 7, named 
5 entities and factoids within corpora can be easily 
tagged to provide annotated corpora. An example of 
tagging in format -1 and format -2 is provided below. 

Tag in format-1: 

10 

e.g.: on the morning of October 9 th — > on the [tim 
morning] of [dat October 9 th ] 

The tagging format of format-2: 

15 

e.g.: on the morning of October 9 th 

on the <TIMEX TYPE=TIME>morning </TIMEX> of <TIMEX 
TYPE=DATE> October 9 th </TIMEX> 

20 It is useful to provide general guidelines 

when tagging corpora to insure consistency and 
accuracy. The following description provides these 
guidelines . 



25 



General Guidelines 
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(1) Placing an "Enter" in original (raw) text to make 
a new line should be avoided. 

(2) A tagging that is marked as u -ms" is described 
below. An example is [P-ms "Deng Xiaoping 

5 theory" . 

(3) A string is allowed to have multi -tagging . If the 
annotators do not have enough information to decide 
the mono- tagging for such strings, then V" is 
introduced for a muti -tagging . 

io [L/o ws:es£|j*>i>] 

(4) OPT: In the case that the annotators are not sure 
whether some strings are to be tagged or not, then 
the mark OPT is introduced to mean that this tagging 
is open to discuss. 

15 [P/OPT 

Guidelines that pertain to all Named Entities 
(Person, Location # Organization) 

1. Proper Nouns are those NEs with objective and 
2 0 specific meanings, while the NEs with abstractive and 

general meanings are not included. 

Eg: The expressions, x ^^Foreigner ' , x M££girl' are 
not Proper Nouns . 

25 

2. For a complex Proper Noun, embedded tagging is 
not allowed. That is to say the maximum matching 
approach is used where the segmented word having the 
greatest number of characters is used. 



30 
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3. TIMES, NUMEX, MEASUREX and ADDRESS that are 
embedded in Person Name, Location Name and 
Organization Name are not to be tagged. 

[0 Jb^Kt] --- right tag 

5 [04b^[int E]^] --- Wrong tag 

4. In the case that an Entity expression contains 
some strings in both English and Chinese while the 
English strings are integrally associated with the 

10 Entity, then the whole expression is tagged as an 
Entity. 

[0 IBMtH^W]] 

[O AmericanfL^^W]] 



5. In a possessive construction, the possessor and 
possessed NE substrings should be tagged separately. 
In Chinese spelling way, the designator "&*J" is a sign 
for such possessive construction. 

[L Hl]B[PlS«*$] 

Note that: the string "ffy" should be considered as 
part of the Entity if it does not function as the 
designator. 

25 to mtfjfc^nm 

6. Quotation Marks are included in the tag if they 
appear within an Entity's name but not if they bound 



15 



20 
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the Entity's name. In Chinese text, Title Marks are 
treated in the same way. 

[o «W3Lfin%"&ft£±] 

«[o mftHft]) mmm 

5 

7. Non-decomposable complex phrase. If a complex 
expression is not an entity as a whole while it 
contains an entity within the expression, then the 
entity within the expression is to be tagged as X P- 

10 ms' , x L-ms' , or x O-ms' . 

If the annotators are not sure whether the 
expression is decomposable or not, then the 
expression is treated as decomposable, and the Entity 
within it is to be tagged. E.g. [L_ms #^]KP "Hong 

15 Kong Foot" , with the same meaning as athlete's foot. 
The expression as a whole is non-decomposable. 
According to the guideline, the word x Hong Kong' can 
be tagged as a Location name, 'L_ms' . E.g. [ord 

mra+/N] M [oX^mfWMn^] -Forty-sixth 
20 Pacific Asia travel Association annual meeting", in 
the guideline the expression is treated as 
decomposable : 

x X^PtMMfrWb^ Pacific Asia travel Association' is 
tagged as organization, while ' X^ff&Mtfcft 
25 Pacific Asia travel Association annual meeting' is 
not an organization. 

For an expression 'Person Name + thought 
(or: theory, law, ideology)', the whole expression is 
to be tagged as 'p-ms' 
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[P_ms ^j£IB] ±j£ "Marx ideology" 

[P_ms "Mao Zedong thought" 

[P_ms feW "Avogadro's law" 

5 8. Treatment of '¥' (... army/ ... military...). The main 
distinction is between interpreting % as an 
adjective, similar to the English "military' (i.e. 
'not civilian') and interpreting ¥ as an 
'organization designator' . In order to get the latter 

10 interpretation, look for case in which 5p is preceded 
by a service 'branch' designator (such as ?& 'air' as 
in 'Air Force' ) 

[LH]¥^Jl "U.S. military aircraft" 

[OlffM^^S?] "SRI Lanka air force" 

15 In general, do not tag terms ending in pf>PA 

"force" as ORGANIZATION. [LM#] ttfPSPPA "West Africa 
peacekeeping force", ^l^Hiti} "military base" is to be 
tagged as LOCATION, NOT ORGANIZATION. [L^#^^¥Sitfe] 
"Peterson air military base" 

20 

9. For a Name Entity (Person name, Location name, 
Organization name) , if it is a kind of multimedia (TV 
& Radio shows, movies and books) , product or treaty, 
it is to be tagged with the "-ms" tag. 
25 [P-ms XP/MF]— Jiftj&iij " Deng Xiaoping (CL-for- 

film)'s release, i.e .the release of the film " Deng 
Xiaoping" 



Since '^/.MFDing Xiao Ping' is the title of a TV 
program. According to the guideline, 'Ding Xiao Ping' 
is to be tagged as *P-ms' . 

[L_ms rm&&! «[L_ms &2M5«JiliJK 

10. Aliases, Nicknames, Acronyms of Entity are to be 
tagged. 

[0 ETS ] 
[0 IBM] 

to itm 

If a Name Entity is embedded in Acronym of 

Entity, then it is not to be tagged. [0 + ^4*;&i&?nM] , 

x 4* ' means 1 4* H ' , no mark up for 4 1 - 

Guideline that pertain only to PERSON 

1. Titles of Person 

Titles and role names are not considered part of a 

person's name. 

[pH^F^IM] S^$P "Albright state minister" 
[L^H] ^C3E [PifM^fi] "Queen Elizabeth of England' 7 
However, generational designators "tt£" , "ft" are 

considered part of a person's name. 

[P +Etft^fMif^Pt0] "fourteenth dalai tenzin 

gyatso" 

[L^H]^:i[P ifS^S— ttt] "England's queen 
Elizabeth II" 



When a person's title falls between the surname 
and the given name, include the title. 

[P$±Mtl]M "Li Chairman Deng-hui Mister" 

2 . Family names are to be tagged as Person 

[P%] R^i 1 "the Jiang family, father and son" 
[Pffiiililifl "the Xidi brothers" 

3. Names of animals are to be tagged as Person. 

4. Saints and other religious figures, the proper 
names are to be tagged as Person. 

[P S1SS] 
[P 

5. Fictional characters are to be tagged as Person. 

6. Fictional animals and non-human characters are to 
be tagged as Person. 

7. When a person's title or dynasty title refers to a 
specific person, then it is tagged as Person. 

[P |ff£] "Kang Xi , i.e. Emperor Kang Xi" 
[P H^fil] "Qin dynasty first emperor" 
[P ] u Laozi" 

8. Miscellaneous Personal Non-taggables 

If people names appear as the titles of 
multimedia (TV and radio show, movies and books) , of 
products and of treaties, the names are to be tagged 
as ' p_ms ' . 



(( [PjnsffSPM^] )) "Mona Lisa" , as the title of a 
painting (or title of a book) , is to be tagged 
"P_ms" . 

In the following five cases, the proper names 
are not to be tagged as Person: laws named after 
people, courts cases named after people, weather 
formations named, diseases/prizes named after people. 

IKA^li --- no tag on 'IT 

^mtfVf ACl^aSc^KiffM no tag on 

[P_ms tag 'iSM^Nobel' as 'P_ms' 



9. Normal pattern of Chinese names 

Generally, person Name is constitute- of two 
parts: Family Name (FN)& Given Name (GN) 



# 


Name Pattern 


How to tag 


Example 


1 


Family Name only 
(FN) 


Tag FN 


[P ^] 


2 


Given Name only 
(GN) 


Tag GN 


[P&&] 


3 


FN+ GN 


Tag the whole 
name 


[Pi^£] 


4 


a. Name (whole 
name, or GN only, 
or FN only)+Title 

b. Title + Name 


Tag name ( s ) 
only, i.e. no 
mark on title 


[p^iidS 
[pi£&]#Jtff 

[P^]fiCjf 

Title includes: 
president , 
premier, 
minister, 
principal , 
professor, 
teacher, PhD. , 
researcher, 
senior engineer, 
chairman, CEO, 
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# 


Name Pattern 


How to tag 


Example 








etc . 


5 


Pref ix+Name 
Name+Suf f ix 


Tag Name only 


*[P$] 
[P$]& 


6 


Name+Name 


Tag the names 
separately 


[P$I&I*] [P^ISJPB] 


7 


Foreign name 


Tag the whole 
name 


[p^a^ift] 
[Ptt^.jft3c]— if 

the character ' . 7 
appears among a 
Person Name, the 
name i s 

considered as a 
whole Entity 



Guideline that pertain only to LOCATION 



The strings that are tagged as LOCATION 
5 include: oceans, continents, countries, provinces, 
counties, cities, regions, streets, villages, towns, 
airports, military bases, roads, railways, bridges, 
rivers, seas, channels, sounds, bays, straights, sand 
beach, lakes, parks, mountains, plains, meadows, 
10 mines, exhibition centers, etc., fictional or 
mythical locations, and certain structure, such as 
the Eiffel Tower and Lincoln Monument. 

[L^rfn [L»SE] [Lftl#S&49^] "Beijing City, 

Haidian district, Zhichun road No. 4 9" 
15 [Lfflftf] "Korea south and north dialogue", 

tag on Korea but no tag on south/north" [L^] 7^ 
"conflict between Arab and Israel", tag on Israel 
but no tag on Arab since it does not refer to a 
specific country 
20 BU [L f^ni&E "former Yugoslavia area" 
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''epicenter located at north 36.0 degrees east 
95 . 9 degrees" . 

5 1. For Location entity embedded in another Location 
Entity, then the whole entity is to be tagged. 

[L HH^¥Sife] " America military base", no tag 
on America Treatment of x \..district/...area" . If 

iffeE means a specific district, then it is to be 

10 tagged as part of the Location; if 'ifeE' generally 
means some area, then it is not to be tagged; if the 
point of itfeE is unclear, then it is not tagged. [L 
l|fe?jf±feIE] M1E%%) [L iWffT] "Lin Yi district now changes 
it name into Lin Yi city" For Organization names 

15 embedded in location names, the organization name are 
not be tagged. [L S^JjC^BI] "White House rose 
garden", no tag on White House. 

2. Locative designators are to be tagged as part of 
2 0 Location. 

[L^M^H] "Maryland state" 

[L #JJ=LM] "Jordan River" 

Compound expressions in which place names are 
listed in succession are to be tagged as separate 
25 instances of Location. [L^#it] [L 3ii£$J#£fl£ S ^ jfl ] [L 
HHHTtT] "Jilin province Yanbian Korean autonomous 
region Tumen municipality" . 
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3 . Transnational locative Entity Expressions 

[L B^lH^I^A "west Africa country leader" [L 
"Asia & Pacific Rim", tagged as one entity [L 
W^£}c] Blc "western hemisphere countries" ^M+HIc No 
mark up . 

Subnational region names: 
[L "South China" 

[L MjfciWE] ''Northwest five provinces" 
UWSjifeE fift^x* ''causing the southwest region's 
passenger service..." , no markup on "southwest" since 
it has no fixed reference [L ^T^nJ&E "South China 
region", here South China has fixed reference. 

4. Time modifiers of locative Entity Expressions. 
Historic-time modifies ("former") are not to be 
included in tagged expressions. 

BU [L l^lJ&E "the former Yugoslavia region" 

5. Space modifiers of Locative Entity Expressions 
[L 4tS^F^l] "North Ireland" 

[L+Mjfify3E] "central Siberia" 

[L 4 1 ] [L ^H] "central and south America", this 
expressions contain two Location entities "central 
America" and "south America" , so they are to be 
tagged separately. ' 



6. Miscellaneous locative non-taggables : 
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Do not tag the names of locations which are in 
language names of the form x-io or x-j£, where x is a 
location. 

''England language, i.e. English' 7 , no tag on 
5 x ^ 7 tpjt "China language' 7 , no tag on^' 

Do tag the location names of the form x-i$, where x 
is a location, ffl [LHJII ] i§ "using Sichuan words 77 , tag 
on Location on E3JII. 

10 7. Do not tag location names which are part of the 
names, ending in M or H, of ethnic groups. 

"the intent was to promote peace and understanding 
between Cyprus Greece-ethnic-group and turkey- 
15 ethnic -group 77 . 

In the expressions X ^SP> X ^XM' , and 
are not to be tagged as Location. However, in the 
expressions 

2 0 'tp' and i 4 1 ? are to be tagged as Location. 



8. Normal pattern of Location 



# 


Location 
pattern 


How to tag 


Example 


1 


Location Name 
only (LN) 


Tag LN 


[Llil3fc] 


2 


LN+ . Location 
Designator 


Tag the whole 
expression 


[idtjRHi] 
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# 


Location 
pattern 


How to tag 


Example 


3 


Compound 

expressions in 
which place 
names are 
listed in 
succession 


Tag separately 


[Lillet] [LW^TU] [ 

> [l±$S] 


4 


Alias or 
nicknames are 
listed in 
succession 


Tag separately 


[L#] , [L 
K] . [L*] ; 
[L$tl [L?Hl [Lpl J&E 

/ 

[l^] [urn mm$m 


5. 


LN expression 
contains person 
name or place 
name 


NO tag for the 
person name or 
the place name 


r T 7±r n*r -i 
[L R^iffr] 


6 


LN+L 

designator, as 
a whole to 
express a 
complete 
concept 


Tag the 
expression 
using maximum 
matching 
approach 





Guideline that pertain only to ORGANIZATION 

Proper names that are to be tagged as 
Organization include stock exchanges, multinational 
5 organizations, businesses, TV or radio stations, 
political parties, religious groups, orchestras, 
bands, or musical groups, unions, non-generic 
governmental entity names such as w congress" , or 
"chamber of deputies," sports teams and armies ( 
10 unless designated only by country names, which are 
tagged as Location) , as well as fictional 
organizations . 
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Corporate or organization designators are 
considered part of an organization name. A basic 
principle for Location tagging is to use maximum 
matching approach. 

s Iff [o+B«f*tt#?i^tt]ttfe[P if**] 

" former China Xinhua News Hang Kong branch 
director Xu Jiatun" 

[04fc3R*^itJf*lSAXWH$^] -Peking University 
Computing Science Department Artificial intelligence 
10 Lab" 



Normal Pattern for Organization 



# 


Type 


Tag 


Example 


1 


organization name+ 
designator 


Tag 
whole 


as 


a 




2 


place 

name+organi zat ion 
name 


Tag 
whole 


as 


a 




3 


Person name 

+ Organization name 


Tag , 
whole. 


as 


a 




4 


Alias or abbreviation 


Tag 
whole 


as 


a 





15 1. National (or international ) legislative bodies 
and departments or ministries are to be tagged as 
Organization. 

[p MKm isi to 
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2. Treatment of Location name immediately preceding 
an organization name. Generally there are two types 
of relations between the Location and the 
Organization: one is procession (such as, ft ffltftSIIrL^jlj 
5 "France aviation and space flight bureau"), the other 

is the geography link (such as IkR'X^ "Beijing 
University" ) . 1 

2.1 For an Organization Entity beginning with a 
10 location name, if removing Location is to lead to a 

location without specific referring, then the 
Location name is to be tagged as part of 
Organization. 

[oit'Miz^] "Beijing University" 

15 [085^ + 3*] "Shenzhen middle school" 

2.2 For the Organization expression mentioned above, 
if there is one location name (or more than one 
names) immediately preceding it, then the location 

20 name and the Organization expression are to be tagged 
separately. 

[L [OitMJz^ 1 ] "China Beijing University" 

[L +B] [Lr^] [O^lJIlt^] "China Guangdong Province 
Shenzhen middle school" 

25 

2.3 For an Organization Entity beginning with non- 
location string (such as filfifjz^F 1 "Tongji University"), 
if there is one Location (or more than one locations) 
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preceding it, then only the Location immediately 
preceding it is to be tagged as part of Organization. 
[0±$|[pl$r^^] "Shanghai Tongji University" 

[L tH] [0±^[W|^^:^] -China Shanghai Tongji 
5 University" 

[O^fcWS^H^] "Hubei province WuGang No. 3 
middle school" 

2.4 If an Organization Entity begins with two or more 
10 paratactic locations, then all those locations are to 
be tagged as part of Organization; if there is other 
location (s) receding the whole Organization, then the 
location and organization are to be tagged 
separately. 

15 [L [Ol*S#^t] "Los Angeles Asia Pacific 

laws center" 

[L #?t] [0*m%%W£] "Hong Kong, China, Hong Kong 
Commercial Association" 

20 2.5 For some complex case, it is unclear whether 
Organization begins with one location or two, then 
tagging should be made according to rule 2.1 *and 
2.2. 

E.g. : mnki$iumx\tfcm& "Los Angeles Taipei 

25 Economics & Culture Office", whether tag as A: [L 

fcftm [osim®xitfrm&] or b: [ommi$Mmx\ifcmm 

In this case, tagging A is chosen by default. 
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2.6 In the case that annotators do not have enough 
knowledge to decide whether organization begins with 
a location. 

5 E.g.: in the expression "Epjt/BMM^ B# 

annotators are not sure whether 
£HltTJ§ is a location name. However, it is 
clear that once this string is removed, the left 
strings have no specific referring. Therefore, 
10 according to 2.1, the expression is to be tagged as: 

[l £psmm] to mmmm^n] . 

2.7 If a location entity immediately follows by an 
Organization, while there is no modifying relation 

15 existing between them, then they are to be tagged 
separately. 

"have promoted the 
cooperation between China and Southeast Asia" 

"on Geneva UN human 

2 0 rights conference" 

3. Phrases ending with u (meeting, conference, 

arts festival, athletic competitions) refer to 
events, and are not to be tagged as Organization. 
25 However, the institutional structures themselves 

steering committees, etc. - should be tagged as 
ORGANIZATION. 



"Olympic sports meeting" 
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[0 "Olympic Committee /, 

If the phrases "...^" refer to "Congress" or 
"Chamber of deputies", then they are to be tagged as 
5 Organization. Notice that session meetings of 
Congress ( or Chamber of deputies) are not be tagged 
as Organization, because they are events, 
[o £BMH Am^lX^iX&T 

io [o KmX±] 

4. If the first person pronouns "$£" > "$cCI" 
functioned as modifiers preceding an Organization 
entity, the pronouns are not to be tagged as part of 

15 Organization. |£H [O^j 3 ^] "I country Communist Party" 
f*Hn [0?f4#;^] "we Tsinghua University". 

5. Embassies and Consulates 

Names of embassies, consulates and other diplomatic 
20 missions should be marked as Organization only if 
both the country they represent and their location 
can be included in the markup. 

jsmmi [omm%mmm±mm -then transferred to 

U.S. stationed at Honduras embassy". 
25 If Embassy descriptor is contiguous with the 

country/district it represents, then the 

country/district is to be tagged as part of 
Organization. 



frft [L#»] 69 [0&tP*£»r«^f&] -go to Honduras 
Embassy in Hong Kong" If Embassy descriptor is 
contiguous with the geography location, then mark any 
locations separately as Location, and do not tag the 
5 embassy as an Organization. 

[L^B]^iiiig±[L^^^]«^P^ifejE^I|^ -U.S. going 
through stationed at Kinshasa embassy and other 
normal channels" . 

10 6. Manufacture and product 

In cases where the manufacture and the product 

are named, the manufacture is to be tagged as 

Organization, while the product is not to be tagged. 

Products must be defined loosely to include 
15 manufactured products (e.g. vehicles), as well as 

computed products (e.g., stock indexes) and media 

products (e.g., television shows). 

to mmx^mm "Dow Jones industrial average 
index" . 

20 

7. Do tag news sources (newspapers, radio and TV 
stations, and news journals) as Organization. Both 
publishers and publications are to be tagged as 
Organization. Note that TV stations differ from TV 
25 shows, the latter not being taggable. 

[OAK B Jft] M9hf&M^f& -Peoples' daily overseas 
edition pay three" . 

i*H[0 tifen] Jftilftj -this is central station 
reporting" . 



8. Organization-like non taggable 

Generic entity names such as "the government" , are 
not to be tagged. 

[ L+B] j&jfr "China government" 

[L ffHSJ&E] tkfif "Xinjiang Autonomy district 
government" [O^B&£SP]tt' "China public safety 
department (s) " . 
Do not mark the term "center" by itself as 

an Organization. However, do mark "party 
center" as an Organization. 

4 + ;&tfr$I#T "under the leadership of the 

center" . 

K[Pfllfffe]ra*^^£?69[0 ft + ifeliSB "party center, 
with comrade Jiang Zeming as its nucleus" . Do not tag 

"exchange fair" as Organization. 
[L 4* EH ] [L i±jni% 0 a n5C^# "China Tianjin 

exported commodity exchange fair" . 

9. Tag on several special named entities. 
[L AKiA^S] "the Great Wall" 

[O &?jf] "White House" 

[0 J&!&MWt1=£]%7F "Kremlin says" 

How to tag TIMEX 
The TIME type is defined as a temporal unit 
shorter than a full day, such as "second, minute, or 
hour" . The DATE sub-type is a temporal unit of a full 
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day or longer, such as May, week, month, quarter, 
year(s), century, etc." The DURATION sub-type 
captures durations of time. 

5 1 . DATE 

For the form string HU/^s/T+ duration, then entire 

phrase is tagged as datJVIET, because the duration is 

embedded in DAT so not to be tagged. 

[datJVIET fj3^c] "the first three days" 
10 [dat $c^]fitfi- " autumn report" 

[datBH^S] " the fourth quarter" 

[dat -h£tftl£] "the fifteenth century" 
[dat#Tf] "the spring Festival" 

Notes that the string "( ±/4VT )^ the first/ 
15 second/last ten days of one month" are to be tagged 

[dat£^± / 0)] "the last ten days of May" Words or 

phrases modifying the experssions, such as 'around' 

or 'about 'are not be tagged. 

^^[datS^KS] "around May 4th" 

2 0 2. Time 

[tim ^M^E^^t 1 ] "three to four o'clock in the 
morning" 

[tim 4tJ5lHrffi3j5Ht59#] "Beijing time 5 hour fifty 
nine minutes" 

25 [tim_MET±^f] . [t im^MET^ *F] > [timJVIETT^p] > [tim_M 

ETB^±] "morning, noon, afternoon, evening" 
Treatment of "^I^Jabout/ around" 
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[tim Wfc±M*)-t&] lijii "in the evening about 7 
hours arrive" 

In this phrase, the string A about' is bounded by two 
Times and it is non-decomposable, so it is to be 
5 tagged. 

[dat AlfHH] [tinrt*] IHifclfcjR "September 

13 th about seven o'clock arrive in Beijing. 
In this phrase, the string is bound by a date and 

a time, so it is decomposable. 
10 3 . DURATION 

[dur 10^c] "10 days" 

£7jcHM [dur mfrz-mz] zmmmm -in the 

quarter century of discussions since the Watergate 
scandal..." 

15 The string w §S ff is not to be included in Duration 
tag, because to include it or not makes little 
difference . 

[dur "exactly fifteen years" 

[dur "exactly at 9 o'clock 

20 arrive at Beijing station" 

+ "nine years drought in ten years, i.e. 

often suffering drought", no mark up on 'nine' and 
'ten', because they are both virtual numbers in case. 
4 . Non - 1 aggabl e : 
25 The time expressions that do not have absolute time 
scale, such as "just now, recently, since 
negotiation, a moment", are not to be tagged. 
In the case that a festival expression does not have 
a absolute time, then it is not be tagged. 
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[L ^J8] Bl^feS^ "India international film 
festival" 

[L +H]^#¥ "Year of China Tourism, referring 
1997" 

[LUB] ftJ&ilB "U.S. Independence Day", no 
markup for Independence Day because of its close 
connection with an event. 

Do not tag the # "spring" in #|£ "Spring 
couplets" . ' 

5. Special Case: 

If two time expressions are in different sub- 
types, then they are to be tagged separately. If the 
two expression are non-decomposable, then they are to 
be tagged together. 

[dat 2^12 0] [tim ±^8,£] "Feb. 12 am 8 o'clock" 
tdatMffi — ] [tim 8,£] "Monday 8 o'clock" 
If a location entity is embedded in time 
expression, the mark *_MET' is introduced to refer to 
the MET -2 guideline. "ER99" can be used to tag 
according to an alternative specification. 
[tirtdb3RB5TlRll997^2^9^19j^284M 

The expressions such as "last year", 
"yesterday", "this morning" are to be tagged 
according to MET- 2 , call for annotators attention on 
the difference and use the extra mark accordingly. 

[dat_MET ££Rdat_ER99 ±4*¥] ] 

[dat_MET 4¥ [dat_ER99 Jt^] ] 

[dat_MET 4*¥[dat_ER99 H jl — 0 ] ] 
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[dat_MET 4"¥[dat_ER99 4^170]] [tim_MET T^F] 
[dat_MET [dat_ER99# X£3C] ] 

[tim_MET H^^[tim_ER99 ] 
[dat_MET [tim_MET B&±] 

5 [tim_MET [tim_ER99 A*]] 

[tira -?-±7n^] 

[dat_MET 3 0] [ti.m_MET T^F] 
[dat_MET 3 0 ] [tim T^F 1 6 Btf 3 0 #] 

#0 [tim_MET [tim_ER99 ±^F 1 1 Bt] M [tim_ER99 

10 ^&3B^]] 

[tim_MET MH&n [tim_MET B&] g 

For the expression '4"-?-this morning' , ER-99 
treats it as a relative time entity and is not to be 
tagged, while in MET-2 the relative time is to be 
15 tagged. 

[dur_ER99 [dat_MET [dat_ER99 11^2 4] S 
[dat_ER99 2 7 0]]] 

[dat_MET [dat_ER99 11 M 2 4] M [dat_ER99 2 7 0]] 
[tim_MET B£i£] 
20 i£ [timJVIET^] 

[tim_MET4] B 

For the expression "Ifc^quite a few years", ER-99 
treat it as a fixed time duration and to be tagged, 
while "^7^ many years" is non-fixed duration and not 
25 be tagged. 

The expression ^ one year" is to be tagged 
as Duration 
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Affi [dur 8*1*01. 
SlftftTffT [dur -¥] 





[mon 900^*^75] WTO 

The expression «4$^ each year''/ "^annual, 
yearly" #¥ltftA 

How to tag NUMEX 

1 . Percentage 

[per U^^H-h^l] "thirty nine percent 7 ' 

[per 5%] "about five percent " 
[per ^L^] xx ninety percent" 

2 . Money 

[mon E^jfrTi^] xx forty five thousand Yuan money" 

[mon KTJil^FAKi rft] * forty five thousand RMB" 

[mon AKrpK^£T7n] XX RMB forty five thousand Yuan" 

• In the case that the same account money is spelled 
with different currencies, they are to be tagged 
separately. The location name embedded in Money is 
not to be tagged. 

[mon 43.6jZ i H7n] xx 43.6 billion USD" 

• The string XX ^J about" does not have an absolute 
concept, so it is not to be tagged. 



#J [mon ~f~*7J7U] "about one hundred thousand Yuan" 
[mon $90,000] "more than $90000" 

• The string "Jlseveral" can be changed by a certain 
number and to express an absolute account, so it is 
to be tagged. 

[mon JL+^TC] "several hundred thousand Yuan" 

• The string over" is not to be tagged generally; 
in the following case it is tagged because the 
entire expression is non-decomposable. 

[mon H^-^Tl^jt] "twenty- seven hundred thousand 
over Yuan" 

• In this guideline, for a location name embedded in 
a currency, if is is spelled with abbreviation then 
it is not tagged, otherwise it is to be tagged as 
x -ms' . 

[mon 2000§T7G] "2000 SID" 

[mon 2000 [Ljns ,ff^JP^3 %} x 2000 Sigapore 
Dollas Yuan' . 

3. Frequency/ Integer/Fraction/ Decima/ Ordinal 
[fre 26#C] 
[fre 
[fre 

[fra 3/4] 

[fra H^H] 

[fra STlfrZA] 

[fra W^^^H^Tn + E] 

[fra 



[fra 4jg^] 
[dec 3.14] 
[dec =J& — m] 

tord m~i ^ 

[ord 1174-^] 
[ord 6JJ&]'K$ 
[ord 

[ord H—]^ 
[int 20 
[int ji73] AR 
[int Jl=f-7J&] 

If the integer/f raction/decimal has a number 
unit as a modifier, then the number unit is to be 
tagged. 

[int J\iM] If "several ' jia' factories" —3^ [int 
5P] A "one family with five 'kou' persons" [int 
58^] "58 times" . 

4. Special case 

• The tab numbers are not be tagged. 

l • SWill. 

2 . mm^mmo 

3 . mmw%m%£mo 
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• Numbers in some idioms, such as "— ^JLone moment" 
"— *3 together" , "— $L first level" "If— only one" 
etc, are not to be tagged. 
5 • Numbers embedded in Person name, Location name or 
Organization name are not to be tagged. 

[0 — 4»] « No _ ]_ middle school" 
[LHB^rfr] "San Ming city" 
ftBMfclft [0 1205&#BM 
10 • If the string "— " functions as article 'a', then 
it is not be tagged. "— f£ one time over "is to be 
tagged. As a part of the ordinal number, " — " is 
to be tagged. 

— ^^r|5" « a city" 

15 § ^CBiJ^ik^I — "one of the biggest companies" 

[ord — #] % " the first prize" 

W^K^tfo [int — fg] "my income is one 
time over his" . 



2 0 How to tag MEASUREX 

MEASUREX includes: Age, Weight, Length, 
Temperature, Angle, Area, Capacity, Speed and Rate, 
[age 34^] 
[age /n+^JR] 
25 [age ftEp]£A 

Fi&m [wei ffcTTM] 

xmm tien -*A-b] mmm 



10 
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m [len 5#] ft [len 
*R?fift( [tem 2800S] ) 
nnm^±=f [ang 90S] ftfc 
&m [are 207Th'] 
313#jl;*7 [cap 34^3Ljj] 
— [capW^"] 

MmMS. [spe 360*##] 
[wei — -h77 3 *RGJlh 
[tem #T5] S!| [tem 6JIRjf ] 



Notes that: for the other units of weights and 
measures in Physics and Chemistry, they are to be 
tagged as "mea" 

[mea 5.5H#] "5.5 watt" 

15 [mea 1.5 ^M] "1.5 Newton" 

How to tag ADDRESSX 
ADDRESX includes: Email, Phone, Fax, Telex, WWW. 
[ema exp@email.com.cn] 
20 Tel: [pho 86-10-66665555] 

[pho 86-10-66665555] 
FAX: [fax 86-10-66665555] 
TELEX: [tel 86-10-66665555] 

[www http: www.hotmail .com ] 

25 For numbers of tel or fax, it is to be tagged 

only there is a designator such as "tel,%iS". 

Although the present invention has been 
described with reference to particular embodiments, 



workers skilled in the art will recognize that 
changes may be made in form and detail without 
departing from the spirit and scope of the invention. 



